Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

q p

signals to be discovered can only reside on a baseline [Greif, et

]. Therefore, estimating the baseline of a spectrum has been

n many areas [Greif, et al., 2008; de Sanctis, et al., 2011;

to, et al., 2019]. To successfully discover the chemicals from a

, the accuracy of baseline estimation is thus very desirable [Hastie

hirani, 1990; Price, et al., 2008; Vyumvuhore, et al., 2014;

-Agudelo, et al., 2017; Acikgoz, et al., 2018].

whole-genome pattern discovery problem

A sequencing technology has been improved continuously for

ecades. Because of this, the high-throughput sequencing data

yed an outstanding role in modern biology research nowadays.

dern DNA sequencing technology or the next-generation

ng technology (NGS) can generate sequencing count data for a

in less than an hour’s time. It thus has changed, shaped and

med the modern biology/medicine research thoroughly [Hood and

2013]. Among them, the most exciting project is the human

project, which has made a huge impact on the genome research

o, 1984; Collins and Galas, 1993; Collins and McKusick, 2001;

et al., 2017; Dunn, et al., 2018].

are many subjects for the whole-genome pattern discovery based

count data. Among them, two are more relevant to the machine

concepts. The first one is how to analyse whole genome

s for knowledge discovery, i.e., the research of the sequence

y alignment approaches. It has had at least five decades since the

ence homology alignment algorithm was developed [Needleman

nsch, 1970]. Besides, the currently widely used one named as

has been developed for three decades [Altschul, et al., 1990].

earlier homology alignment algorithms align sequences pair-

nd mostly globally. Along with the huge increase of sequencing

g NGS, the speed of sequence comparison is hugely challenged

y when comparing a novel sequence against a database of